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Abstract 

In this paper we are interested in indexing texts for substring matching 
queries with one edit error. That is given a text T of n characters over 
an alphabet of size a, we are asked to build a data structure that answers 
to the following query: find all the occ substrings of the text which are at 
edit distance at most 1 from a string q of length m. In this paper we show 
two new results for this problem. The first result suitable for arbitrary 
alphabet size a uses 0(n(log'^ n + log cr)) words of space and answers to 
queries in time 0(m + occ). This improves simultaneously in space and 
time over the result of Cole et Al [S]. The second result suitable only for 
constant alphabet relies on compressed indexes and comes in two variants: 
the first variant uses O(nlog^n) bits of space (where e is any constant 
such that < £ < 1) and answers to queries in time 0{m + occ) while the 
second variant uses 0(n log log n) bits of space and answers to queries in 
time 0{{m-\~occ) log log n). This second result improves on the previously 
best results for constant alphabets achieved in Lam et Al [16| and Chan 
et Al [B]. 

1 Introduction 

The problem of approximate string matching with over texts was intensively 
studied. The problem consists in given a pattern g, a text T (the characters of 
T and q are drawn from the same alphabet of size cr) and a parameter fc, to find 
all the substrings of T which are at distance at most k from q. There exists many 
different distances which can be used for this problem. In this paper, we are 
interested in the edit distance in which the distance between two strings x and y 
is defined as the minimal number of edit operations needed to transform x into 
y where the considered edit operations are deletion of a character, substitution 
of a character by another and finally insertion of a character at some position 
in the string. Generally two flavors of the problem are considered: the online 
variant and the indexed variant. In the online variant, we assume that we know 
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Data structure 


Space usage (in bits) 


Query time 


Lam et Al I16j 
Lam et Al [H] 

Lam et Al [IB] 


0{n) 

0{n log log n) 

O(nlog^n) 


0((m log log n + occ) log^ n) 
0{{m log log n + occ) log log n) 
0{m log log n + occ) 


Chan et Al [5^ 
Chan et Al 


0{n) 

0{n\ogn) 


0{ni + poly(log ri) + occ log^ n) 
0{m + log n log log n + occ) 


This paper 
This paper 


0(nlog logn) 
0(nlog^7i) 


0{{m + occ) log log n) 
0{ni + occ) 



Table 1: Comparison of existing solutions for constant alphabets 



Data structure 


Space usage (in bits) 


Query time 


Cole et Al [8J 
Buchsbaum et Al [i] 
This paper 


0{n\og'^ n) 
0{n\og^ n) 

0{n log n(log^ n + log cr)) 


0(m + log n log log n + occ) 
0{m log log n + occ) 
0{m + occ) 



Table 2: Comparison of existing solutions for arbitrary alphabets 



the pattern in advance and the text arrives character by character. In our case 
we are interested in the offline version in which we can preprocess the text in 
advance so that we can efficiently answer to queries which consist only in the 
pattern and a parameter k. Further, we restrict our interest to the case k = I. 

1.1 Related work 

We now mention the best results from the literature we are aware of for our 
problem. We only consider the results with worst case space and time bounds. 
We thus do not consider results like the result in p/7j in which either the query 
time or the space usage only hold on average on the assumption that the text 
and/or the patterns are drawn from some random distribution. For general 
integer alphabet a result by Amir et Al [IJ further improved by Buchsbaum 
et Al [4] has led to O(nlog^n) space with query time 0(m log log m + occ). 
Later Cole et Al [8] described an index for arbitrary number of errors k but 
which for the case k = 1 uses 0(n log^ n) bits of space and answers to queries in 
0(m+log n log log n + occ) time. For the special case of constant sized alphabets 
a series or results culminated with the results of Lam et Al [16] , Chan et Al [5] 
and Chan et Al |6] with various tradeoffs between the occupied space and query 
time. By adapting some ideas of and combining with two indices described 
in [S] we are able to essentially remove the additive polylogarithmic term from 
the query times associated with some of the best previously known results while 
using the same space (or even less in some cases). The reader can refer to 
tables [1] and [2] for a full comparison between our new results and the old ones. 

As can be seen both our indexes improve on the state of the art. We should 
mention that the results in the table attributed to [16] are not stated in that 
form, but can be easily deduced from the main result in [IB] by using different 
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compressed text index implementations. The results in the first table are all 
unsuitable for large alphabets as their query times all have a hidden linear de- 
pendence on a. That means that for very large alphabets of size a ~ O(y^) or 
of size (T = 2®(^^°8^"^ for example, the query time of those algorithms will be un- 
reasonable J7(y^) or r2(2^'°s"-)^ gy contrast the query times of the algorithms 
in the second table do not have any dependence on the alphabet size. Our result 
in the second table always dominates the Cole et Al result for both space and 
time. Only in case of very large alphabet (e.g a — 8(n) or cr = 0(-y/ri)) and 
long pattern length m = i7(lognloglogn) will both have the same space and 
time. In the other cases, our data structure will dominate. For example in case 
a = 2®(^'°s"), our data structure will use space O(nlog^'^n) bits while the 
data structure of Cole et Al still uses O(nlog^n) bits. Another example is the 
case m < \ogn where the query time of Cole et Al will be r2(rnloglogn -I- occ) 
while our query time will still be just 0{m + occ). 

2 Preliminaries and outline of the results 

At the core of our paper is a result for indexing all substrings of a text T of n 
characters bounded by some given length b. In particular, we show the following 
two theorems: 

Theorem 1 For any text T of length n characters over an alphabet of fixed 
size, given a parameter b, we can build an index of size 0{n{h^ + log^ n)) bits 
(where e is any constant such that < e < 1) so that for any given string q of 
length m < b we can report all of the occ substrings of the text which are at 
edit distance 1 from q in time 0{m + occ). Alternatively we can build a data 
structure which occupies O(n(log6-|-loglogn)) bits of space and which answers 
to queries in time 0((m -f occ) (log b + log log n)). 

The second theorem is based on Cole et Al approach [8J combined with some 
ideas used in the first theorem and with two recent results of |3] and [14] . 

Theorem 2 For any text T of length n characters over an alphabet of size a, 
given a parameter &, we can build an index of size 0{n\ogn(b' + logu)) bits 
(where e is any constant such that < e < 1) such that for any given string q 
of length m < b we can report all of the occ substrings of the text which are at 
edit distance 1 from q in time 0{m + occ). 

The first theorem gives an immediate improvement for constant sized alpha- 
bets when combined with a result appearing in [S^: 

Theorem 3 For any text T of length n over an alphabet of size a we can build 
the following indices which are able to return for any query string q the occ 
occurrences of substrings of T which are at edit distance 1 from q: 
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• An index which assumes that a — 0(1) and which occupies O(nlog'^n) 
bits of space and answers to queries in time 0(ra + occ) where e is any 
constant such that < £ < 1. 

• An index which assumes that u — 0(1) and which occupies 0(n log log n) 
bits of space and answers to queries in time 0{(m + occ) log log to). 

The second theorem can also be combined with another result which has 
appeared in [S] to get the following result suitable for arbitrary alphabet sizes: 

Theorem 4 For any text T of length n over an alphabet of size a we can build 
an index which occupies using only 0(n(log^ n + logcr)) bits of space and able 
to return for any query string q the occ occurrences of substrings of T which are 
at edit distance 1 from q in time 0(m + occ). 

The second theorem can be used for any alphabet size while the first one holds 
only for fixed alphabet. Our two theorems can be used only for matching strings 
of bounded length but can provide an improvement when used in combination 
with previous result which are efficient only for long strings. 

Our new methods for proving theorems [1] and [2] makes use of some ideas 
introduced in ^ combined with tools which recently proposed in |[3j and |14) . 
In [2] a new dictionary for approximate queries with one error was proposed. 
For a string of length m it achieved 0(m + occ) query time while at the same 
time using optimal space (up to constant factors) . A naive use of that dictionary 
to the problem of full-text indexing was also proposed in that paper. However 
while this leads to the same 0(m + occ) query time achieved in this paper, the 
space usage was too large namely 0(n(log n log log n)^ logcr) bits of space for 
alphabet of size u. Nonetheless we will borrow some ideas from that paper and 
use them to prove our main results. 

The paper is organized as follows: we first begin with the data structure 
suitable for constant sized alphabets (theorems [1] and [S]) in section [3] before 
showing the data structure for large alphabets (theorems [2] and |4]) in section El 
We finally conclude the paper by mentioning some open problems in section [31 

2.1 Model and notation 

In the remaining we note by x the reverse of string x. That is 'x is the string 
X written in reverse order. For a given string s, we note by s[i,j] or by s[i..j] 
the substring of s spanning the characters i through j. We assume that the 
reader is familiar with the trie concept and with classical text indexing data 
structures like suffix trees and suffix arrays. The model assumed in this paper is 
the word RAM model with word length w — O(logn) where n is the size of the 
considered problem. We further assume that standard arithmetic operations 
including multiplications can be computed in constant time. We assume that 
our text T to indexed is of length n and its alphabet is of size a < 

^Throughout the paper we always assume that cr < n, as otherwise we can store an auxihary 
hash based dictionary with space usage 0(n logcr) and which stores all characters appearing 
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2.2 Basic definitions 



We now briefly recall some basic text indexing data structures data structures. 

Suffix array A suffix array [18] (denoted S'A[l..n]) built on a text T of length 
n just stores the pointers to the suffixes of T in sorted order (where the order 
is the usual lexicographic order). Clearly a suffix array occupies nlogn bits. 

Suffix tree A suffix tree |,20J is a specially built tree on a text T where : 

• Every suffix of T is associated with a leaf in the tree. 

• A factor p of T is associated with an internal node in the tree iff there 
exists two characters a and h such that pa and pb is also a factor of T 

• The subtree rooted at any internal node associated with a prefix p contains 
all the suffixes of T which have p as a prefix. 

• An edge connecting an internal node associated with a factor p to a node 
(which could be a leaf or an internal node) associated with a string s = ps' 
(which could either be a suffix or another factor) will be labeled with the 
string s' . 

For more detailed description of the suffix array or the suffix tree, the reader 
can refer to any book on text indexing algorithms. The essential property of a 
suffix tree is that it can be implemented in such a way to occupies 0{n) pointers 
(that is 0(n log n) bits) in addition to the text and that given any factor p oiT 
it is possible to find all the suffixes of T which are prefixed by p in 0(|p|) time. 
A suffix tree can also be augmented in several ways so as to support many other 
operations, but in this paper we will use very few of them. 

3 Data structure for constant sized alphabets 

In this section we give a proof of theorems [T] and [31 For that we first begin by 
describing the data structures used in theorem [1] in section [SHI then describe 
how the queries are executed on those data structures in section 13.11 which 
concludes demonstration of theorem [TJ Finally theorem [3] is proved in section 
[331 

3.1 Data structure for short patterns 

Our data structure for short patterns relies on a central idea used in [2]. The 
idea was that of using a hash based dictionary, finding all the strings in the 
dictionary which are at distance 1 from a pattern q of length m can be done in 
0(1) time. This stems from two facts: 

in T then at query time if we meet that character at some position i in the pattern we know 
that the error occurs at that position and we can easily find the matching of the pattern using 
the indexes of this paper, by considering substitution or deletion of that character. 
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1. If the dictionary use some suitable hash function H then after we have 
done a preprocessing step in 0{m) time, the computation of H{p) takes 
constant time for any string p at distance 1 from q. 

2. Using a trie and reverse trie, we can verify any matching in constant time 
(this idea was used many times before). 

In our case we wih use different techniques. As we are searching in a text rather 
than a dictionary we will be looking for suffixes prefixed by some string p instead 
of finding exact matching entries in a dictionary. For that reason we will use a 
weak prefix search data structure which while still using reasonable space, will 
permit us to look for matching suffixes in very fast time. 

We now describe more in detail the data structure we use for the matching 
of patterns of bounded length over small alphabets (theorem [T]) . This data 
structure uses the following components: 

1. A suffix array SA built on the text T. 

2. A suffix tree S built on the text T. In each node of the suffix tree repre- 
senting a factor p of T we store the range of suffixes which start with p. 
That is we store a range [«, j] such that any suffix starts with p iff its rank 
k in lexicographic order is included in 

3. A reverse suffix tree 5* tree built on the text T the reverse of the text T 
(we could call S a prefix tree as it actually stores prefixes of T) . In each 
node of S representing a factor p we store the range of suffixes of T which 
start with p. That is we store a range [i, j] such that any suffix of T starts 
with p iff its rank k in lexicographic order relatively to all other suffixes of 
T is included in [«, j]. This is equivalent to say that any prefix of T ends 
with p iff its rank k in lexicographic order relatively to all other prefixes 
of T is included in [i, j]. 

4. A table S'A~^[l..n]. This table stores for each suffix T[z..n] for all I <i<n 
the rank of the suffix T[i..n] in lexicographic order relatively to all other 
suffixes of T. 

5. A table PA^^[l..n]. This table stores for each prefix T[l..i] the rank of 
the prefix T[l..i] in lexicographic order relatively to all the other prefixes 
of T. 

6. A polynomial hash function H parametrized with a prime P > n^a and an 
integer r (a seed). For a string x we have H{x) — x[l] ■r + x[2] • r^...a;[|a;|] • 
rl'^l. The details of the construction are described below. The hash func- 
tion essentially uses just 0(log n) bits of space to store the numbers P and 
r. 

7. A constant time weak prefix search data structure (which we denote by 
Wo) built on the set U, the set of substrings (factors) of T of fixed length 
b characters to which we add a set of 6 — 1 artificial strings of length b) . 
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Note that \U\ < n. This data structure which is described in [3] comes in 
two versions one of which uses 0{n{b^ + log log cr)) bits of space for any 
constant < £ < 1 and answers to queries in 0(1) time. We note that 
query time by tvFo- Details are described below. 

8. Finally a prefix sum data structure Vq which stores for every p G U sorted 
in lexicographic order, the number of suffixes of T prefixed by p (for each 
of the artificial strings this number is set to one). This prefix sum data 
structure uses 0(|[/|) — 0{n) bits of space and can be queried in constant 
time. 

Note that the total space usage is dominated by the indexing data structures 
{SA,SA~^,PA, the suffix and prefix trees) which occupy 0(n log n) bits of space. 

Text indexing data structures The only operation we need to do on the 
prefix tree is for a given pattern q, to determine for each prefix p of q the range 
of all prefixes of T which have p as a suffix. Similarly for the suffix tree we only 
need to know for each suffix s of q the range of suffixes which are prefixed by s. 
The classical representation for our text indexing data structures SA,SA~^,PA, 
the suffix and prefix tree all occupy 0(rt log n) bits of space. However in our 
case for representing the text indexing data structures, we need to go below 
0(n log n) bits used by the classical representations. We thus will make use 
of compressed representations of the text indexing data structures [TTJ [T^] . In 
particular we need only to have the following results: 

• For every prefix pi = (/[l^/j — i + l,|(7|] ofgof length i determine the 
range [pli,pri] of prefixes of T which are suffixed by pi. This can be 
accomplished incrementally in 0{m) time by following the suffix links in 
the prefix tree S. That is deducing the range corresponding to the prefix 
of q of length i from the range of the prefix of q of length i + 1 in 0(1) time 
(following a suffix link at each step takes 0(1) time). In the context of 
compressed data structures, this can be accomplished using the backward 
search on the compressed representation of the prefix array PA [11] still 
in time 0(m) and representing PA in 0(n) bits only (assuming constant 
alphabet). In this case the range corresponding to the prefix of p of length 
I + 1 is deduced from the range corresponding to the prefix of p of length 
i. 

• For every suffix s; = q[\q\ — i + I, \q\] of q of length i determine the range 
[sli,sri] of suffixes of T which are prefixed by Si. This can be done in a 
similar way in total 0{m) time by either following suffix links in standard 
representation of the suffix tree S or by backward search in a compressed 
representation of the suffix array SA. The compressed representation 
occupies 0{n) bits only. 

• For any i we need to have a have a fast access to S'A[i],S'A~-'^[i],PA[i],PA~^[i]. 
In case those four tables are represented in explicitly in O(nlogn) bits of 
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space the access time is trivially 0(1). However in the context of com- 
pressed representation, we need to use less than 0{n\ogn) bits of space 
and still be able to have fast access to the arrays. 

The first two results can be summarized with the following lemma: 

Lemma 1 [TT] Given a text T of length n over constant sized alphabet we can 
build a data structure with 0(n) bits of space such that given a pattern q of 
length to: 

• We can in 0(to) time determine the range of suffixes of T prefixed by Si 
for all suffixes s; of q of length i G [1..to]. 

• We can in 0(to) time determine the range of prefixes of T suffixed by pi 
for all prefixes pi of q of length i £ [1..to]. 

If the alphabet is non constant, then we can obtain the same results using 
0(71 log n) bits of space. 

The third needed text indexing result is summarized with the following lemma; 

Lemma 2 [T^l Uni US] Assuming a constant alphabet size, we can compress 
the array PA,SA,PA~^ and SA^^ with the following tradeoffs : 

• A representation in 0(n log log n) bits with access time tsA — O(loglogn) 
time. 

• A representation in O(nlog^n) bits with access time tsA = 0{1) time. 

Weak prefix search data structure A weak prefix search data structure 
built on a set of strings U permits given a prefix p of any element in U to 
return the range of elements of U prefixed by p in lexicographic order. If given 
an element which is not prefix of any element in U, it returns an arbitrary 
range. This weak prefix search data structure needs to use a hash function 
H and assumes that after preprocessing a query string p, the computation of 
H{p[l, \i\]) for any i takes constant time. It also assumes that every H{p[l, 
is distinct for all p G U and all i E [1,6]- In our case we will use the following 
time/space tradeoffs described in [3J: 

Lemma 3 Given a set of n strings of fixed length b each over alphabet a, we 
can have a weak prefix search data structure with the following tradeoffs: 

• Query time tw = 0{c) with a data structure which uses O {n{b^/'^ log b + 
log log (j)) bits of spac^ for any constant c. 

^Actually the result in [3] states a space usage 0{nb^^'^ log b) but assumes a constant 
alphabet size. However, it is easy to see that the same data structure just works for arbitrary 
a in which case it uses 0(n(f)^/'^ log 6 + loglog cr)) bits of space. 
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• Query time tw = 0(log6) with a data structure which uses 0{n{\ogb + 
log log cr)) bits of space. 

The query time assumes that the computation of H{p') for any prefix p' of p 
takes 0(1) time, where p is the query string and H is the hash function used 
by the weak prefix search data structure. 

In our case, the weak prefix search data structure will be built on the set 
[/, the set of factors of T' — T4f^~^ (that is the text T concatenated with 6—1 
times character ^ where ^ is a special character absent from initial alphabet 
and which is lexicographically smaller than all other characters in the alphabet) 
of fixed length b (or b\oga bits). Note that the set U contains exactly the set of 
factor of T of length b to which we add 6—1 artificial strings which are obtained 
by appending to every suffix of T of length i < b. 

Hash function In the hash function H the parameter P is fixed but the seed 
r is chosen randomly. The goal is to build a hash function H such that all 
the hash values of the substrings of T used by the weak prefix search are all 
distinct. For a randomly chosen r this is the case with high probability. If it is 
not the case we randomly choose a new r and repeat the construction until all 
the needed substrings of T are mapped do distinct hash values. 

Prefix-sum data structure A prefix-sum data structures which permits to 
succinctly encode an array A[l..n] of integers of total sum D in space n{2 + 
\\og{D / n)~\) bits, so that the sum X]i<j<i ^[i] for any i can be computed in 
constant time. This can be obtained by combining fast indexed bitvector im- 
plementations |131 [7] with Elias-Fano coding |9| |10j . 

3.2 Queries 

Preprocessing To make a query on our full-text index for a string q of length 
m, we will proceed in a preprocessing step which takes 0{m) time. The prepro- 
cessing consists in the following phases: 

1. Compute the arrays L[0..|q|] and i?[l..|(7| + 1] by traversing the suffix tree 
S for the string q and the prefix tree S for the string q. Initially L[0] = 
R[\q\ -I- 1] = [l,n] and then L[i] stores the range of prefixes suffixed by 
q[l..i] and R[i] stores the range of suffixes prefixed by q[i,m] (the range 
[l,n] is naturally associated with the empty string as the empty strings 
g[1..0] or q[m + l..m] is suffix and prefix of any other string). This step 
takes always 0{m) whatever the suffix array implementation we use. 

2. We precompute an array which stores all the values of r' for all < i < m. 

3. We precompute all the values H{q[l,i]) for all 1 < i < m. That is all the 
hash values for all the prefixes of q. This can easily be done incrementally 
as we have H{q[l, 1]) = ^[l] • r and then H{q[l, i + 1]) = H{q[l, i]) + q[i + 
1] • r*+^ for all 1 < i < TO. 
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4. We prcconipute all the values H{q[m — i + 1, m]) for all 1 < i < m. That 
is all the hash values for all the suffixes of q. This can also easily be done 
incrementally as we have i?(g[m, m]) = q\m] ■ r™ and then H{q[i,m]) = 
{H{q[i + 1, m]) + q[i]) • r for all 1 < i < m. 

Hash function computation We now describe some useful properties of the 
hash function H which will be of interest for queries. An interesting property 
of the hash function H is that after the precomputation phase, computing H{p) 
for any p at edit distance 1 from q takes constant time: 

1. Deletion at position i: computing the hash value of p = q[l, i — l]q[i + l, m] 
{q[l, i] is defined as the empty string when « 0) is done by the formulae 
Hip) = H{q[l, + H{q[i + 1, to]) • r»-i {H{q[l, i - 1] = if i = 1). 

2. Substitution at position i: computing the hash value ofp = q[l,i — l]cq[i + 
1, m] is done by the formulae H{p) = H{q[l,i — 1]) + (c+H{q[i + l, m\))-r^ . 

3. Insertion at position i: computing the hash value of p = q[l. i — l]cq[i, m] 
is done by the formulae H{p) = H{q[l,i — 1]) + (c + H{q[i, rn])) ■ r*. 

Moreover, computing H{p') for any prefix p' of a string p at edit distance 1 from 
q also takes constant time: 

• Tha hash value for a prefix p' of length j of a string p obtained by deletion 

at position i in q can be obtained by H{p') = H{p) — H{q[j + 2, m]) ■ r^~^^ 
{H{q[j + 2,to] = if j + 2 > to) whenever j > i oy H{p') = H{q[l,j] 
otherwise. 

• Tha hash value for a prefix p' of length j of a string p obtained by substitu- 
tion at position i in g can be obtained by H{p') — H (p) — H {q[j + 1, m\)-r^ 
{H{q[j + 1, to] = if j + 1 > to) whenever j > i or H{p') = H{q[l,j\ oth- 
erwise. 

• Tha hash value for a prefix p' of length j of a string p obtained by insertion 
at position i in q can be obtained by H{p') — H{p) — H{q[j, to]) • r^~^ 
{H{q\j,m] = if j > to) whenever j > i or H{p') = H{q[l,j] otherwise. 

Checking occurrences Suppose that we have found a potential occurrence 

of a matching substring of the text obtained by one deletion, one insertion 
or one substitution. There exists a standard way to check for the validity of 
the matching (this has been used several time before) using the arrays PA~^ 
and SA^^. Suppose that we have located a potential occurrence of a string p 
obtainable by deletion of the character at position i in the query string q. In this 
case we have p = q[l,i — l\q[i + 1,to]. Moreover, Suppose that we have found 
for p a potentially matching location j in the text. Then checking whether this 
matching location is correct is a matter of just checking that PA^^[j + « — 2] € 
L[i — 1] and that SA~^[j + i — 1\ e R[i + 1]. That is just checking whether 
^b-i + TO — 1] = p = — 1]5[« + 1,to] is just a matter of checking that 
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+ — q[l--j] and that T[j + i — + m — 2] = q[i + l,m] which amounts 
to checking that the prefix of T ending at T[j + i — 2] is sufRxed by q[l, i — and 
that the suffix starting at T[j + i — + m — 2] is prefixed by q[i + l,in\. Those 
two conditions are equivalent to checking that PA~-^[j + i — 2] G L[i — 1] and 
that SA~^[j + i — 1] G R[i + 1] respectively. Checking a matching location for 
an insertion or a substitution can be done similarly. For checking a matching 
at position j in the text of a pattern obtainable by insertion of a character c at 
position i it suffices to check that PA~ ^[j + i — 2] S L[i—1], that T[j + i — 1] = c 
and finally that SA^^[j + i] e R[i]. Similarly checking for a matching of a string 
obtainable by a substitution can be obtained by checking that PA~^[j + i — 2] G 
L[i], that T[j + i - 1] = c and finally that SA-^[j + i] G R[i + 1]. We thus 
have the following lemma which will be used as a central component for query 
implement at ion : 

Lemma 4 Given any pattern q for which the arrays L and R have been pre- 
computed, we can for any string p at distance 1 from q (where the string p is 
described with 0(1) words of information needed to described the edit operation 
which transforms q into p) and a location / check whether p occurs at location I 
in the text by probing the text and the arrays SA~^ PA~^ a constant number 
of time and thus in time 0{tsA)- 

Query algorithm We now describe how queries for a given string q of length 
m are implemented. Recall that we are dealing with an alphabet of fixed size. 
That means that the number of strings at distance 1 from a given q is 0{m(j) = 
0{m). Our algorithm will simply check exhaustively for matching in the text 
of every string p which can be obtained by one insertion, one deletion or one 
substitution in the string q. Each time we check for a string p we also report 
the location of all occurrences in which it matches. The matching for a given 
string p proceeds in the following way: 

• Do a weak prefix search on Wq for the string p which takes either constant 
time or 0(log5) time depending on the implementation used. The result 
of this weak prefix search is a range [^o,?'o] of elements in U which are 
potentially prefixed by p. 

• Using the prefix-sum data structure Vb, compute the range [Zi,ri] of 
suffixes of T potentially prefixed by p. This range is given by li = 
Si<t<;o ^o[i] and ri = X]]^<«rg its computation takes constant 
time. 

• Do a lookup for j = ^^[ri] in the suffix array. This takes either time 
O(loglogn) or 0(1) depending on the suffix array implementation. 

• Finally check that there is a match in the location j in the text (this is 
done differently depending on whether we are dealing with an insertion, 
a suppression or a deletion). This checking which is done with the help 
of lemma m needs to do one access to PA"^ and one access to SA^^ and 
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thus takes either O(loglogn) or 0(1) times depending on the implemen- 
tation. If the match is correct we report the position j and addition- 
ally report all the remaining matching locations which are at positions 
SA[ri + 1], SA[ri + 2], , , SA[li - l],SA[li] (by querying the compressed 
suffix array using lemma |31 Otherwise we return an empty set. 

For proving the correctness of the query, we first prove the following lemma: 

Lemma 5 Given a string p of length at most b such that p is a prefix of at 
least one suffix of T, with the help of Wq and Vq, we can find the interval of 
suffixes prefixed by p in time 0{t]Yg ) . Further given any p we can check whether 
it prefixes some suffix of T in time 0(tvFo +^sa) and if not return an empty set. 

Proof. We start with the first assertion. If p is prefix of some suffix s E T then it 
will also be some prefix of some element in U. This is trivially the case if |s| > b 
and this is also the case if |s| < & as we are storing in U the string which 
is necessarily prefixed by p as well. Now we prove that the returned interval 
[h, ri] is the right interval of suffixes prefixed by p. First notice that the weak 
prefix search by definition returns the interval [Iq, tq] of elements of U which are 
prefixed by p. Now, we can easily prove that li = X)i<t<io ^oW exactly the 
number of suffixes which are lexicographically smaller than p. This is the case 
as we know that the sum X)i<t<io includes exactly the following: 

1 . All suffixes of length less than b which are lexicographically smaller than 
p and for which an artificial element was inserted in U. 

2. All the suffixes of length at least b whose prefixes of length b are lexico- 
graphically smaller than p. 

On the other hand we can prove that ri — li gives exactly the number of suffixes 
prefixed by p. That is ri — li = X](o<t<ro which gives the number of suffixes 
of length at least b prefixed by elements of U prefixed by p in addition to suffixes 
of length than b prefixed by p and for which a corresponding artificial element 
has been stored in U. Now that the first assertion of the lemma has been 
proved we turn our attention to the second assertion. This second assertion is 
immediate: give a string p which does not prefix any suffix we know by lemmaU] 
that the checking will fail for any suffix of T and thus fail for the suffix at 
position j in the last step which thus returns an empty set. I 

The following lemma summarizes the query for a prefix p at distance one 
from q: 

Lemma 6 Given any pattern q for which the arrays L and R have been com- 
puted, we can for any string p at distance 1 from q (where the string p is 
described with 0(1) words of information needed to describe the edit operation 
which transforms q into p) search for all the occ suffixes prefixed by p in time 

0{two+{occ+l)tsA)- 
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Proof. We first prove the correctness of the operations as described above an 
then prove the time bound. For that we examine the two possibilities: 

• p is not prefix to any suffix in T in which case the query should return 
an empty set. It is easy to prove the equivalent implication: if the data 
structure returns a non empty set, then there exists at least some sufRx 
of T prefixed by p. For the data structure to return a non empty set the 
checking using lemma 2] must return true for the location j in the text 
and for this checking to be true q must be prefix of the suffix starting at 
position j in the text. 

• p is prefix of some suffix of T in which case the query must return all those 
suffixes. Note that by definition the weak prefix search Wq will return the 
right range of elements of U which are prefixed by p. The justification 
for this is that a single match implies that there exists at least one suffix 
prefixed by p which implies the weak prefix search Wq returns the correct 
range of factors in U and thus the bit- vector Vq returns the correct range 
of suffixes SA[rilSA[n + l]..SA[ri + 2], , , ^^[^i - liSA[h] which must 
also have p as a prefix. 

We now prove the time bound. In the case that p is not prefix of any suffix in 
T, the query time is clearly 0{twQ + tsA) as we are doing one query to weak 
prefix search Wq (which takes 0(1]^^) time), one query of the prefix-sum data 
structure Vq in constant time and finally the checking using leamma |4] which 
takes 0(^5^) time. In case p is prefix of some suffixes, then the first step for 
checking that the set is non empty also takes constant time and reporting each 
occurrence takes additional tsA time per occurrence. | 

3.3 Solution for arbitrary pattern length 

We use the following lemma proved in [Sj section 3.2]. 

Lemma 7 For any text T of length n characters over an alphabet of constant 
size we can build an index of size 0{n) bits so that we can report all of the occ 
substrings of the text which are at edit distance 1 from any pattern q of length 
m > log** n log log n in time 0{m + occ). 

The solution for theorem [3] is easily obtained by combining theorem [T] with 
lemma [7] in the following way: we first build the index of [5j whose query time 
is upper bounded by 0{m + occ) whenever m > log* n log log n and whose space 
usage is 0{n) bits for some constant c. Then we build the data structure of 
theorem [1] in which we set b = log n* log log n and e = 5/5 where 5 is any 
constant which satisfies < 5 < 1. In the case we have a string of length less 
than b we use the index of theorem[7]to answer the query in time 0{m-\-occ) when 
using the first variant or in time 0{{m -\- occ) log log n) when using the second 
variant. In the case where we have a string of length at least b = log^ n log log n, 
we use the index of [5] answering to queries in time 0{m + log* n log log n -f 
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occ) = 0(m + occ). The space is thus dominated by our index which uses either 
0(n(log'* nloglogn)'^) = 0(nlog''n) or 0(nlog(log'*nloglogn)) = O(loglogn) 
bits of space. 

4 Data structure for large alphabets 

Actually the time bound of theorem [1] have linear dependence on the alphabet 
size as the query time is actually 0{am + occ). This query time is not reasonable 
in case a is non constant. In this section we show a solution which has no 
dependence on alphabet size. In order to get this solution we combine the 
result of [5] with an improved version of [B] . This is shown in section 14.51 
Before that we first show how we improve the solution in [B] but only for query 
strings of length bounded by a parameter b. This is proved in sections |4.2[ 14.31 
and 14.41 This improvement is an extension of the data structures described in 
the previous section, but with two major differences: it does not use compressed 
variants of the text indexing data structures and its query time is independent 
of the alphabet size. Before describing the details of the used data structures, 
we first recall in section |4T1 a few definitions and data structures which will be 
used in our construction. 

4.1 Tools 

For proving our results for arbitrary alphabet sizes we will make use of the 
following additional tools: 

Centroid path decomposition A centroid path decomposition of a tree is a 
special decomposition of a tree of n nodes into sets of disjoint paths (sequences 
of consecutive parent-child edges) called centroid paths. The main property of 
this decomposition is that any root to leaf path in the original tree contains at 
most 0{log n) centroid paths. A centroid path decomposition relies on the heavy 
child notion where the only heavy child of a node n is defined as the child with 
the largest subtree size among all the children of n (when more than one child 
share the largest subtree size, then choose any one of them). All other children 
of a node are defined as light children. A heavy edge is an edge connecting a 
parent to its heavy child. A light edge is an edge connecting a node to a light 
child. The centroid paths are built in the following way: An initial centroid path 
is constituted by the only root-to leaf path which consists in only heavy edges. 
Each of the other centroid paths consist in a single initial light edge followed by 
a maximal sequence of heavy edges which terminate at a leaf of the tree. In any 
centroid path, the labels of the heavy edges are called branching characters. We 
say that a centroid path x hangs from a centroid path j/ at a node p if and only 
if there exists some node Ux in x which has a (light) child Uy in y (The light 
edge connecting to Uy is the first edge in centroid path y). 
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Centroid path traversal For a given pattern of length m the traversal of a 
suffix tree decomposed according to a centroid path decomposition will traverse 
at most t = min{m,\ogn) centroid paths. We note those traversed centroid 
paths by Ci,C2,,,,C(, where each centroid path Ci hangs from the centroid 
path Ci-i. 

ID range color reporting data structure A ID range color reporting data 
structure solves the following problem: given an array A[l..n] of colors each 
chosen from the same alphabet of size cr, we cant to answer to the following 
queries: given an interval return all the distinct occ colors which occur in 
the array elements A[z + The recent solution devised in [14i uses 

optimal space 0(n log cr) and permits to answer to queries in optimal 0{occ) 
time. Moreover that solution can also report the distinct colors in a given 
interval, one by one in 0(1) time per color. 

Lemma 8 [H] Given an array A[l..n] of colors from the set {1, 2, , , u}, we 
can build a data structure of size O(nlogcr) so that given any range [i-.j] we 
can return all the occ distinct colors which appear in in time 0{occ). 

Moreover the colors can be returned one by one in 0(1) time per color. 

4.2 High level description 

The solution of theorem [T] has a too strong alphabet dependence in its query 
time. In particular testing all insertion and substitution candidate strings takes 
0{ma) time as we have 0{ma) candidates and spend 0(1) time for testing 
every candidate. By contrast, deletion candidates are at most m and thus can 
be tested in 0{m) time. We now give a high level description of our solution. 
Our solution is based on a slightly simplified version of the solution described 
in [6J which itself is a simplification of the solution initially described in [8 . The 
main idea for doing approximate matching for insertions and substitutions is to 
build correction trees (deletion and substitution trees) for each centroid path. 
In total there will be exactly 0{n) trees which store O(nlogn) elements in total 
(see [5] for the details). 

Type-1 substitution tree A type-1 substitution tree is built in the following 
way. Consider all the centroid paths which hang from a given centroid path C. 
Consider the set of suffixes which are stored in those centroid paths. Each such 
suffix will be modified by substitution on exactly one position in the suffix and 
then stored in the type-1 substitution tree. More in detail, consider the edges 
of a centroid path C in the top to bottom order with nodes ni,n2...nt. Each 
edge {ni,ni+i) for 1 < i < t is a heavy edg^ labeled with some character d. 
Additionally for each node rii for 1 < i < t there will be one or more centroid 
paths hanging from Ci at the node The substitution tree is built in the 

•^Note that for the initial centroid path (ni,n2) is also a heavy edge 

*the same holds for ni in case C is the initial centroid path (actually ni in this case is the 
root of the tree) 
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following way, for each light child n' of a node n.i where the edge {ni,n') is 
labeled with the character c' we store all the sufHxes stored in the subtree of n' 
but in in which the character c' is replaced with the character q. 

Type-1 deletion tree A type-1 deletion tree will be built in a similar way to 
the type-1 substitution tree. For a given light child n' of a node Ui where (n^, n') 
is labeled with the character c', we store only the modified suffixes which are 
obtained from the suffixes stored in the subtree of n' in which the character Ci 
immediately follows the character c'. This time the modification consists in just 
removing the character c' from the suffixes instead of replacing the character c' 
with the branching character Ci. 

Type-2 substitution tree A type-2 substitution tree will be built exactly as 
the type-1 substitution tree but in which a modified suffix is built by substituting 
a wildcard character a (this character is a new character added to the alphabet) 
instead of the branching character Ci . 

4.3 Data structure Implementation 

In addition to the data structures used for theorem [T] in which we use non com- 
pressed variants of the text indexing data structures and implement a general 
strategy for searching in the substitution and deletion trees. This search is only 
useful when used for patterns of length at most b. As a consequence before 
building our data structures we shrink the deletion and substitution tree by 
retaining only the suffixes modified at one of the b first positions. A correction 
tree of size n' is implemented through the use of the following data structures: 

• We build a weak prefix search data with constant query time on [/', the 
set of all the prefixes of length b of the modified suffixes stored in the 
correction tree. This data structure uses 0{n' \ogn' {b^ + log log a)) bits 
of space. Similarly to theorem [1] modified suffixes of length i < b are 
appended with and stored in the weak prefix search data structure 
as artificial elements. 

• We store in a prefix-sum data structure the number of modified suffixes 
prefixed by each element in U' (once again the entry of an artificial element 
is considered as a one). The entries in the prefix-sum data structure follow 
the lexicographic order of the prefixes. This prefix sum data structure will 
use 0{n') bits of space. 

• We finally store a vector R of 0{n') elements in which for each modified 
suffix in lexicographic order we store the modification on the initial suf- 
fix which was used to obtain this modified suffix. The encoding of the 
modification differs according to type of the tree we are encoding: 

— In case of a type-1 deletion tree, we store a pair which consists in the 
deleted character c' plus the position of deletion. This takes log cr 
bits and log b bits respectively. 
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— In case of a type-1 substitution tree, we store a pair which consists 
in the substituted character c' plus the position of substitution. This 
also takes log a bits and log b bits respectively. 

— In case of a type-2 substitution tree, we only store the substituted 
character c' using only log cr bits. The position of substitution is 
not necessary in this case as it is known to be at the position of the 
wildcard character a in the modified suffix. 

This vector will thus occupy 0(ri'(log 6 + log cr)) bits. Additionally we 
augment the vector with a ID range color reporting data structure of 
lemma [51 This data structure will be able to return return all the d 
distinct elements in any range i?[r, I] for any r and any I in time 0{\) per 
element. This data structure does not augment the space usage by more 
than a constant factor. 

Summing up the space used by all the components of a correction tree we get 
a total space usage 0[n'{¥ + log cr)). Summing up over all the correction trees 
we get a total 0{n\ogn{h^ + log cr)) bits of space usage. An important point to 
emphasize is that we use the same hash function H for building all the weak 
prefix search data structures of all the correction trees. In this context, the value 
P > v?a used for computing the hash function is large enough to ensure that 
with high probability all the hash values used in any of the weak prefix search 
data structures will be distinct (recall that we are using one prefix search per 
correction tree and we have three correction per centroid path, which implies 
that we are storing in total 0{n) weak prefix search data structures). 

4.4 Query algorithm 

The query algorithm works for exact matches and matches with one deletion, 
one insertion or one substitution. Exact matching is trivial. Deletion matching 
works exactly in the same way as in theorem [1] We thus mostly concentrate on 
the matching for substitutions and insertions. Matching for substitutions is done 
with the help of the type-1 and type-2 substitutions trees while the matching 
for insertions is done with the help of the type-1 Deletion tree. More in detail 
recall that at the matching for a pattern we encounter t < logn centroid 
paths named Ci, , , Ct- Notice that the mismatch with the centroid path d for 
i < t always happens at a branching character c; in which the corresponding 
pattern character is bi ^ Ci. For the last step Ct we do not necessarily have a 
mismatch with q and even if we have one it does not necessarily happen at a 
branching character. In this case we also note the two mismatching characters 
in the pattern and in the query by bf and Ct respectively. The strategy we use is 
the same as the one used in [8 and [G, except that the implementation differs. 
More precisely for matching substitutions we do the following for each centroid 
path Ci for each i < t: 

• We match all the suffixes obtained by substitution above the character Ci 
in the centroid path. Note that this substitution can only happen at a 
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branching character above the character cJl. The suffixes which could po- 
tentially match above character Ci (more precisely suffixes which could be 
obtained by insertion at branching characters before character Ci) are thus 
included in the set of suffixes stored in the centroid paths hanging above 
Ci. For matching those suffixes it suffices to query the type-1 substitution 
tree for the pattern q. Details on the implementation are below. 

• We match all the suffixes obtained by substitution at the branching char- 
acter Ci. For that, we use the type- 2 substitution tree however this time 
the query is done on the pattern q in which the character bi is replaced 
with the character a. 

For the last centroid path Ct the second step may differ. In fact at the last step 
the mismatch could well happen at a non branching character or even there 
could be no mismatch at all. In the latter case we do not do the second step 
at all. In the former case the second step consists in just querying directly the 
suffix tree for the pattern q in which the mismatching character ht is changed 
to the character ct found in the centroid path (this is done using lemma IH]). 
Otherwise if the mismatch happens at a branching character, then we just do 
the second step as it was done for the other centroid paths. 
For matching insertions a similar strategy is used: 

• We match all the suffixes obtained by insertion above the character Cj in 
the centroid path. Once again an insertion can only happen at a branching 
character above the character cjf|. The suffixes which could potentially 
match with an insertion above character a (more precisely insertions right 
before the branching characters above character a) are thus included in 
the set of suffixes stored in the centroid paths hanging above For 
matching those suffixes it suffices to query the type-1 deletion for the 
pattern q. Details on the implementation are below. 

• We match all the suffixes obtained by insertion in the pattern right before 
character For that, we also use the type-2 substitution tree however 
this time the query is done on the pattern q in which the character a is 
inserted right before the character bi. 

As in the case of substitution, the second step of the insertion matching will 
also differ for last centroid path Ct. That is in case the mismatch happens at 
a non branching character, we will just query the suffix tree for the pattern q 
in which the character found in the centroid path is inserted right before the 
mismatching character 6^. In case we did not have a mismatch, then the second 
step is just omitted. 

^It is easy to see that a substitution at a non branching character do not match any suffix 
as all the suffixes in the subtrees which hang below a non branching character all contain that 
non branching character. 

^It is easy to see that any match which can be obtained by an insertion anywhere between 
two branching characters can also be obtained by insertion right before a branching character. 
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Query implementation The queries on the substitution and deletion trees 
are similar and follow the same strategy. Each query for a string 5 on a correction 
tree follows a preliminary phase which consists in following preliminary steps: 

• First query the weak prefix search corresponding to the tree. This returns 
an interval [io, Jo] and takes 0(1) time. 

• Query the prefix-sum data structure for the indexes iq and Jq which returns 
two values ii and ji. From there, we have an interval This also 
takes 0(1) time. 

• Query the ID range coloring data structure for the interval [«Oiii] and 
only retain the first color. The query time is also 0(1). 

• Using this first color, we can check whether the result of the weak-prefix 
search is right (whether the prefix is a prefix of a modified suffix). To that 
end, we apply the change to the modified suffix. This change depends on 
the type of the query and the type of the tree: 

— In case the query is for a type-1 substitution tree, the color is a pair 
{char,pos) and we just substitute the character char at position pos 
in q getting a string p at the condition that pos is above hi. Otherwise 
we immediately conclude that the query result is invalid and that we 
do not have any match. 

— In case the query is for a type-1 deletion tree, the color is also a pair 
{char, pos) and we just insert the character char at position pos in q 
getting a string p, at the condition that pos is above bi. Otherwise 
we immediately conclude that the query result is invalid and that we 
do not have any match. 

— In case the query is for a type-2 substitution tree, the color is just a 
character char and we just substitute the character a (which was at 
the position of character pi) with the character char getting a string 
P- 

Then we query the suffix tree for the suffixes prefixed by p with the help 
of lemma [51 This returns a range of suffixes and we just check for the 
validity of the matching of the first suffix in that range using lemma |4l 

If the matching at the last step of the preliminary phase fails, we deduce the 
non existence of any modified suffix prefixed by q and the query terminates. 
Otherwise, we deduce that there is indeed at least one modified suffix prefixed 
by q and will thus return all the matching suffixes which correspond to the 
modified suffixes prefixed by q. For that, we requery the ID range coloring data 
structure again for all the colors in interval [joiJi] but this time we list all the 
reported colors. Then depending on the type of the tree and the query, we do 
the following: 
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• For a type-1 substitution or deletion tree we first check whether the re- 
ported color [char^pos) is such that g[pos] ^ char. If this is not the 
case, then we should not report any string as the reported color is wrong 
meaning that all returned colors are also wrong. Otherwise for each color 
{char,pos) we do the following: as in the preliminary phase, we first ap- 
ply the modification on q as indicated by the color getting a new string p. 
That is for a color consisting in a pair [char, pos) substitute or insert the 
character char at position pos in q getting a string q' , and finally query 
the suffix tree for the string p using lemma [51 

• For a type-2 substitution tree, for each color char where char ^ bi and 
char ^ Ci, we do the following: as in the preliminary phase, we first apply 
the modification on q as indicated by the color getting a new string p. 
That is we substitute a with char. Then finally query the suffix tree for 
the prefix p using lemma |6l 

We thus have the following lemma: 

Lemma 9 Querying a correction tree takes 0{occ + 1) time where occ is the 
number of matching suffixes. 

Proof. The proof is immediate. Suppose that no matching suffix exists. Then 
clearly the last step in the preliminary phase can not succeed as the checking 
can not return true unless there existed a matching suffix. Moreover the time for 
the preliminary phase is 0(1). Now suppose that we have exactly occ matching 
suffixes, then the preliminary phase will succeed in 0(1) time and the following 
phase will return the matching suffixes one by one in 0(1) time by matching 
suffix for a total 0{occ) time. I 

Finally theorem [3] follows immediately from lemma [9l That is we know that we 
traverse exactly at most min{m, log n) centroid paths and that for each centroid 
path i we need to do at most three queries to correction trees which take in 
total 0(1 + ocCi) where occi is the number of matching suffixes which hang from 
centroid path Ci. As the sets of matching suffixes which hang from different 
centroid paths are disjoint, we deduce that the total query time is 0{m + occ). 

4.5 Solution for arbitrary pattern length 

In order to prove theorem |4] we will make use of the following lemma also proved 
in [5]: 

Lemma 10 O section 2. 3, Theorem 1] For any text T of length n characters 
over an alphabet of size a we can build an index of size O(nlogn) bits so that 
given any pattern p of length m > log^ n log log n we can report all of the occ 
substrings of the text which are at edit distance 1 from p in time 0(r7i + occ). 

In order to get theorem |4l we combine this lemma with theorem [2j The combi- 
nation is also straightforward. That is we build both indexes where the index 
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of theorem [2] is built using the parameter b = log^ nloglogn. Then if given a 
pattern of short length m < \og^ n log log n, we use the index of theorem[2]to an- 
swer in time 0{m + occ), otherwise given a pattern of length m > log'^ n log log n 
we use the index of lemma 1101 to answer in time O (to + log'^ n log log n + occ) = 
0(to + occ). 

5 Open problems 

An obvious open problem is whether the time-space tradeoffs achieved in this 
paper can be further improved. For constant sized alphabets the space-time 
tradeoff is not too far from what is achieved for exact matching. However for 
larger alphabets the space usage seems a bit high and an improvement seems 
plausible. 
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